Skip to content

noncebalancer: use endpointsharding, ignore ready status#8679

Merged
jsha merged 5 commits intomainfrom
noncebalancer-endpointsharding
Mar 19, 2026
Merged

noncebalancer: use endpointsharding, ignore ready status#8679
jsha merged 5 commits intomainfrom
noncebalancer-endpointsharding

Conversation

@jsha
Copy link
Copy Markdown
Contributor

@jsha jsha commented Mar 14, 2026

The old noncebalancer only saw READY SubConns, which was a problem during the brief periods when a SubConn was reconnecting (for instance due to a GOAWAY from the server), since nonce redemption requests are not fungible between backends. Unfortunately, READY SubConns are all that the balancer interface provides. And we can't get that interface to pass non-READY SubConns to our picker without reimplementing or copying all its SubConn management logic.

Luckily, grpc provides the endpointsharding balancer implementation that does exactly what we want. It maintains a collection of child balancers each owning a single endpoint (note: for our setup an endpoint is equivalent to a single address, though it can be one-to-many). It also lets us query the state of each child, including the endpoint it's responsible for.

This allows us to construct a picker that is aware of all available backends, even those that aren't currently READY. That, in turn, prevents us from temporarily serving errors while a given nonce redemption backend is reconnecting.

To see another example of endpointsharding in use, see the customroundrobin implementation.

For more context on how endpointsharding came to be implemented, see gRFC A61: IPv4 and IPv6 Dualstack Backend Support.

If you're curious how endpointsharding passes around the information about non-READY SubConns, it uses a type assertion from a balancer.Picker to its internal type.

Alternative to #8672. Fixes #8662.

This edits noncebalancer.go in place for ease of diffing, and also copies the original grpc/noncebalancer (with no edits) to grpc/noncebalancerv1. But don't take my word for it:

diff <(git show origin/main:grpc/noncebalancer/noncebalancer.go) grpc/noncebalancerv1/noncebalancer.go
diff <(git show origin/main:grpc/noncebalancer/noncebalancer_test.go) grpc/noncebalancerv1/noncebalancer_test.go

The old noncebalancer only saw READY SubConns, which was a problem during the
brief periods when a SubConn needed to reconnect (for instance due to a GOAWAY
from the server). Unfortunately, that's all the balancer interface provides.
And we can't get it to pass non-READY SubConns to our picker without
reimplementing or copying all its SubConn management logic.

Luckily, grpc provides the [`endpointsharding`] balancer implementation
that does exactly what we want.  It maintains a collection of child
balancers each owning a single endpoint (note: for our purposes an
endpoint is equivalent to addresses, though it can be one-to-many).
It also lets us query the [state] of each child, including the
endpoint it's responsible for us.

This allows us to construct a picker that is aware of all available backends,
even those that aren't currently READY. That, in turn, prevents us from
temporarily serving errors while a given nonce redemption backend reconnects.

To see an example of `endpointsharding` in use, see the [`customroundrobin`]
implementation.

For more context on how `endpointsharding` came to be implemented, see
[gRFC A61: IPv4 and IPv6 Dualstack Backend Support](a61).

[`endpointsharding`]: https://pkg.go.dev/google.golang.org/grpc/balancer/endpointsharding
[state]: https://pkg.go.dev/google.golang.org/grpc/balancer/endpointsharding#ChildState
[a61]: https://github.com/grpc/proposal/blob/master/A61-IPv4-IPv6-dualstack-backends.md
[`customroundrobin`]: https://github.com/grpc/grpc-go/blob/99f36d4a0c28bc967a8d3fe23ebc2a264b322070/examples/features/customloadbalancer/client/customroundrobin/customroundrobin.go
@jsha jsha marked this pull request as ready for review March 16, 2026 16:55
@jsha jsha requested a review from a team as a code owner March 16, 2026 16:55
@jsha jsha requested a review from beautifulentropy March 16, 2026 16:55
@jsha jsha marked this pull request as draft March 17, 2026 17:10
@jsha
Copy link
Copy Markdown
Contributor Author

jsha commented Mar 17, 2026

Back in draft because I'm currently implementing the config-based switching between implementations.

jsha added 3 commits March 17, 2026 12:35
Set maxConnectionAge to 1s, and make nonce_test.go collect 300 nonces, then
redeem them one at a time, separated by 10ms. This creates a high likelihood of
a redemption request occuring during a reconnect.
@jsha jsha marked this pull request as ready for review March 17, 2026 19:57
@jsha
Copy link
Copy Markdown
Contributor Author

jsha commented Mar 17, 2026

Ready for review. The new noncebalancer is selectable by setting in wfe2.json:

			"srvResolver": "nonce-srv-v2",

Copy link
Copy Markdown
Member

@beautifulentropy beautifulentropy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! Using endpointsharding is a really clean way to get visibility into non-READY backends without reimplementing SubConn management. I have just one optional comment, let me know what you think.

Comment thread grpc/noncebalancer/noncebalancer.go
@github-actions
Copy link
Copy Markdown
Contributor

@jsha, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

aarongable
aarongable previously approved these changes Mar 19, 2026
Copy link
Copy Markdown
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits.

Comment thread grpc/internal/resolver/dns/dns_resolver.go Outdated
Comment thread grpc/noncebalancer/noncebalancer.go
Comment thread grpc/noncebalancer/noncebalancer.go Outdated
Comment thread test/config/wfe2.json Outdated
@jsha jsha dismissed stale reviews from aarongable and beautifulentropy via 7c713e8 March 19, 2026 18:44
@beautifulentropy beautifulentropy self-requested a review March 19, 2026 18:58
@jsha jsha merged commit 3f18560 into main Mar 19, 2026
18 checks passed
@jsha jsha deleted the noncebalancer-endpointsharding branch March 19, 2026 21:12
@jsha jsha mentioned this pull request Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix badNonce CI flake

3 participants